Data Loading

# libraries
library(tidyverse)
library(DESeq2)
library(vsn)
library(gt)

# setting working directory to assessment data 1
setwd("~/Desktop/MSc Bioinformatics/RNASeq & Transcriptomics/Assignment-1/data analysis")

# Importing data
# Experimental Design
colTable  <- read.csv('Design-Exp.csv', row.names=1, colClasses=c('factor'))

# Gene Counts
gene_countTable <- read.csv("gene_count.csv", row.names=1)

# Transcript Counts
transcript_countTable <- read.csv("transcript_count_matrix.csv", row.names=1)

# sample-specific reads’ statistics
read_stats <- read.csv("reads and alignment stats.csv")

Aim a)

# Read statistics table using gt
gt(read_stats) %>% 
  # Adding column headers
  tab_spanner(label = 'Adapter Trimming', 
              columns = c(2:4)) %>% 
  
  tab_spanner(label = 'Quality Trimming',
              columns = c(5,6)) %>% 
  
  tab_spanner(label = 'Alignment',
              columns = c(7:12)) %>% 
  
  # Remaning cols
  cols_label(Total.Reads='Total Reads',
             Contaminated.Reads = 'Contaminated Reads',
             Contamination.Rate = 'Contamination Rate',
             FASTQ.kept = 'FASTQ kept',
             FASTQ.discarded = 'FASTQ discarded',
             Aligned...0 = 'Aligned = 0',
             Aligned...1 = 'Aligned = 1',
             Aligned...1.... = '(%)',
             Aligned...1.1 = 'Aligned < 1',
             Aligned...1.....1 = '(%)',
             Alignment.rate = 'Alignment Rate (%)'
             ) %>% 
  
  # Formatting (%)
  fmt_percent(columns = c(4,9,11)) %>% 
  
  fmt_number(columns = c(2,3,5:8),decimals = 0) %>% 
  
  # Table styling
  cols_align(align = 'center') %>% 
  
  opt_horizontal_padding(scale = 1) %>% 
  
  opt_vertical_padding(scale = 0.5) %>% 
  
  # Text Font
  tab_style(style = cell_text(font = 'Times',
                              size = 'small'),
            locations = cells_body()) %>% 
  
  tab_style(style = cell_text(font = 'Times',
                              size = 'medium'),
            locations = cells_column_spanners()) %>% 
  
  tab_style(style = cell_text(font = 'Times',
                                size = 'small'),
              locations = cells_column_labels()) %>% 
  
  # Table size
  tab_options(table.width = pct(90))
Sample Adapter Trimming Quality Trimming Alignment
Total Reads Contaminated Reads Contamination Rate FASTQ kept FASTQ discarded Aligned = 0 Aligned = 1 (%) Aligned < 1 (%) Alignment Rate (%)
S1 5,007,025 1,549 0.03% 5,004,799 2,226 82 4,470,224 89.32% 534493 10.68% 100%
S2 4,788,617 1,612 0.03% 4,785,507 3,110 114 4,212,824 88.03% 572569 11.96% 100%
S3 4,271,937 1,448 0.03% 4,270,360 1,577 78 3,816,979 89.38% 453303 10.62% 100%
S4 4,826,596 1,327 0.03% 4,825,174 1,422 115 4,261,802 88.32% 563257 11.67% 100%
S5 4,509,475 1,353 0.03% 4,507,158 2,317 94 3,940,526 87.43% 566538 12.57% 100%
S6 4,779,843 1,040 0.02% 4,777,688 2,155 95 4,174,304 87.37% 603289 12.63% 100%
S7 3,741,884 1,051 0.03% 3,740,762 1,122 97 3,175,404 84.89% 565261 15.11% 100%
S8 7,709,668 2,165 0.03% 7,707,684 1,984 196 6,548,180 84.96% 1159308 15.04% 100%
S9 5,627,869 1,407 0.03% 5,626,337 1,532 104 5,005,223 88.96% 621010 11.04% 100%
S10 5,955,048 1,632 0.03% 5,952,639 2,409 1 5,199,405 87.35% 753121 12.65% 100%
S11 5,020,636 1,164 0.02% 5,019,314 1,322 96 4,478,190 89.22% 541028 10.78% 100%
S12 4,236,576 1,179 0.03% 4,235,721 855 115 3,695,713 87.25% 539893 12.75% 100%

Above in Table 1., we have the sample read statistics from Chromosome 2 of the mouse genome single ended sequencing for all 12 samples. This Tuxedo protocol uses the Sickle, Scythe, Hisat2, Samtools, and Stringtie pipeline. Here we have on average 5 million reads per sample, with a low adapter contamination rate (average of 0.03%). This means that only very few of our reads contain adapter sequence and is well the below the failure rate of 10%.

Reads with a quality below a Phred (+33) score of 10 were trimmed, and only a few thousand of those reads were trimmed for every sample. In terms of the alignment, very few of the reads never aligned (Aligned = 0), the vast majority of reads aligned once against the reference genome (~ 88%). The rest aligned at multiple spots against the reference genome (~12%), these would indicate reads that aligned to multiple repeat regions in the chromosome.

This means that the reads from all samples aligned well against our reference mouse genome for chromosome 2. The quality and coverage of the reads appears to be even between all our samples and therefore we have more confidence in our downstream analyses.

Aim b)

Generation of DESeq2 objects

We will conduct the differential expression (DE) analysis between our groups using the DESeq2 package in R, created by Love, Huber, and Anders (2016). This statistical tool for differential expression for High-Throughput Sequencing (HTS) count data was created to correct for the noise generated by sequencing many genes from low sample size experiments. The tool also corrects for the exaggerated DE log-fold change (LFC) estimates from low count data, the tool creates shrinkage estimates for dispersion and LFC and applies a correction that de-emphasizes the effect and lowers the occurrence of a false positive result. This ultimately generates a DE table that should represent the underlying biology better.

The code below will generate DESeq objects for both the gene and transcript count data and will apply the corrections stated above. It will also create regularised-log (r-log) version of our data to be used later on.

# reordering the count cols since the order was s1, s10, s11, s12, s2, ...
gene_countTable <- gene_countTable[,c(1,5:12,2:4)]
transcript_countTable <- transcript_countTable[,c(1,5:12,2:4)]

# Gene Data
# Creates a DESeq object 
gene_dds <- DESeqDataSetFromMatrix(countData=gene_countTable,
                              colData=colTable,
                              design= ~ group)

# Eliminates rows with a count of 0 for all samples, this will eliminate issues
# later with log2 transformation of the data
NoZero <- (rowSums(counts(gene_dds))>0)
gene_dds <- gene_dds[NoZero,]

# DESEQ analysis
gene_dds <- DESeq(gene_dds)

# Regularised log of gene counts
gene_rld <- rlog(gene_dds, blind = F)

# Transcript Data
# Creates a DESeq object 
transcript_dds <- DESeqDataSetFromMatrix(countData=transcript_countTable,
                              colData=colTable,
                              design= ~ group)

# removing rows with only zeroes
NoZero <- (rowSums(counts(transcript_dds))>0)
transcript_dds <- transcript_dds[-NoZero,]

# DESEQ analysis
transcript_dds <- DESeq(transcript_dds)

# rlog transformed data of transcript dds
trans_rld <- rlog(transcript_dds, blind = F)

Dispersion Estimate Plots

# Gene version
plotDispEsts(gene_dds,
             finalcol='blue4',
             genecol='orange',
             main='Gene count Dispersion plot',
             xlab = 'Mean of Normalised counts',
             ylab = 'Dispersion') 

# Transcript version
plotDispEsts(transcript_dds,
             finalcol='blue4',
             genecol='orange',
             main='Transcript count Dispersion plot',
             xlab = 'Mean of Normalised counts',
             ylab = 'Dispersion')
Figure 1. Dipersion Estimate PlotsFigure 1. Dipersion Estimate Plots

Figure 1. Dipersion Estimate Plots

In Figure 1., we can see the dispersion, this equates to the standard deviation over the mean, plotted against the normalized counts for all genes / transcript in all our samples. The red line represents the expected dispersion for the given count. The curve is a negative binomial distribution and shows that the lower the count, the greater the dispersion. This is why low counts often have greatly exaggerated LFC if not corrected properly. The transcript counts appear to have a greater range in the dispersion around the curve than the gene counts and the dispersion shrinkage appears less pronounced than in the gene counts plot. This could be due to a lot of the transcripts having a dispersion > 10 leading to a lower shrinkage effect.

For the rest of this report, we will focus on the gene count data.

Principal Component Analysis (PCA)

# Gene PCA plot
DESeq2::plotPCA(gene_rld,intgroup=c("group"),ntop=1000) +
  labs(title='Gene Count PCA plot')+
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.6))

# Transcript PCA plot
DESeq2::plotPCA(trans_rld,intgroup=c("group"),ntop=1000) +
  labs(title='Transcript Count PCA plot')+
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.6))

# Transcript PCA (Not shown)
Figure 2. PCA of gene counts

Figure 2. PCA of gene counts

In Figure 2., we can see the similarity between our biological replicates and the groups they reside in. For the first principal component, A and C are more similar to each other than B. For the second prinicipal component, this shows an in-group difference, where samples in group C vary less than in A and B they vary greatly. From this we can hypothesize that we will see a greater DE when contrasting A / C against B as opposed to A against C, where there will be minimal DE.

Standard deviations versus row means Plots

# R-log version
rld_meanSd <- meanSdPlot(assay(gene_rld))$gg +
  labs(title='Gene Count meanSdPlot plot (regularised Log)',
       x='Rank (mean)',
       y='Standard Deviation') +
  ylim(0,3.5) +
  scale_fill_gradient(low = 'orange', high = 'blue4') +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# Log normalised version
log_dd <- varianceStabilizingTransformation(gene_dds,
                                            blind = TRUE,
                                            fitType = "local")

log_meanSd <- meanSdPlot(assay(log_dd))$gg +
  labs(title='Gene Count meanSdPlot plot (Log Normalised)',
       x='Rank (mean)',
       y='Standard Deviation') +
  ylim(0,3.5) +
  scale_fill_gradient(low = 'orange', high = 'blue4') +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# Showing both plot side by side
log_meanSd
rld_meanSd
Figure 3. Standard deviations versus row means PlotsFigure 3. Standard deviations versus row means Plots

Figure 3. Standard deviations versus row means Plots

In Figure 3., where we plot the standard deviation of the transformed data against the mean using r-log and log2 normalised transformations. These plots were made to compare homoscedasticity of both transformations of the data. Our data can be deemed homoscedastic if the average standard deviation is constant across our observations (Rank).

The figure shows that the standard deviation in the r-log plot has a smaller range with less outliers compared to the normalised data. So we can state the r-log data is more homoskedastic.

MA plots

# A vs B
## LFC = 0
AvB.lfc0 <- results(gene_dds,contrast=c("group","A","B"))

## LFC < 1
AvB.lfc1 <- results(gene_dds,contrast=c("group","A","B"), lfcThreshold=1,
                    altHypothesis = 'greaterAbs')

# A vs C
## LFC = 0
AvC.lfc0 <- results(gene_dds,contrast=c("group","A","C"))

## LFC < 1
AvC.lfc1 <- results(gene_dds,contrast=c("group","A","C"), lfcThreshold=1,
                    altHypothesis = 'greaterAbs')

# B vs C
## LFC = 0
BvC.lfc0 <- results(gene_dds,contrast=c("group","B","C"))

## LFC < 1
BvC.lfc1 <- results(gene_dds,contrast=c("group","B","C"), lfcThreshold=1,
                    altHypothesis = 'greaterAbs')
# A vs B
## LFC = 0
DESeq2::plotMA(AvB.lfc0, alpha=0.01,
               main='A-B DE (LFC=0, P<0.01)',
               ylim=c(-10,10), colSig='blue4')

# Labels number of significant genes
text(10000,7, labels = paste0("Significant genes = ",
                              dim(AvB.lfc0 %>% subset(padj < 0.01))[1]))

## LFC = 1
DESeq2::plotMA(AvB.lfc1, alpha=0.01,
               main='A-B DE (LFC<1, P<0.01)',
               ylim=c(-10,10), colSig='blue4')

# Labels number of significant genes
text(10000,7, labels = paste0("Significant genes = ", 
                              dim(AvB.lfc1 %>% subset(padj < 0.01))[1]))

# A vs C
## LFC = 0
DESeq2::plotMA(AvC.lfc0, alpha=0.01,
               main='A-C DE (LFC=0, P<0.01)',
               ylim=c(-10,10), colSig='blue4')

# Labels number of significant genes
text(10000,7, labels = paste0("Significant genes = ",
                              dim(AvC.lfc0 %>% subset(padj < 0.01))[1]))
## LFC = 1
DESeq2::plotMA(AvC.lfc1, alpha=0.01,
               main='A-C DE (LFC<1, P<0.01)',
               ylim=c(-10,10), colSig='blue4')

# Labels number of significant genes
text(10000,7, labels = paste0("Significant genes = ",
                              dim(AvC.lfc1 %>% subset(padj < 0.01))[1]))

# B vs C
## LFC = 0
DESeq2::plotMA(BvC.lfc0, alpha=0.01,
               main='B-C DE (LFC=0, P<0.01)',
               ylim=c(-10,10), colSig='blue4')

# Labels number of significant genes
text(10000,7, labels = paste0("Significant genes = ",
                              dim(BvC.lfc0 %>% subset(padj < 0.01))[1]))
## LFC = 1
DESeq2::plotMA(BvC.lfc1, alpha=0.01,
               main='B-C DE (LFC<1, P<0.01)',
               ylim=c(-10,10), colSig='blue4')

# Labels number of significant genes
text(10000,7,labels = paste0("Significant genes = ",
                             dim(BvC.lfc1 %>% subset(padj < 0.01))[1]))
Figure 4. MA plotsFigure 4. MA plotsFigure 4. MA plotsFigure 4. MA plotsFigure 4. MA plotsFigure 4. MA plots

Figure 4. MA plots

In Figure 5. we are plotting the LFCs against mean of normalised counts from the DE analysis of pairwise group contrasts. The adjusted P value threshold for all contrasts was set at 0.01 and we are comparing the effect of setting LFC=0 and LFC<1 as the null hypothesis. The number of significantly DE genes can be seen in the right-hand corner of the plot. This shows that the group contrasts of A-B and B-C have an equivalent number of DE genes where A-C contrast has very few for both null hypotheses. This aligns with the results from the PCA, where A and C are more closely related than they are to group B.

By setting our null hypothesis to LFC=0, it would mean than any LFC greater than 0 would disprove our null hypothesis (gene is not deferentially expressed) and that our alt-Hypothesis (the gene is deferentially expressed) will be accepted. In a biological context, the idea that two biological groups would be perfectly equal is highly improbable. Therefore, setting the null hypothesis to LFC =0 would mean that a gene with a small LFC that passes the Wald test p.adjusted threshold of 0.01 would be deemed biologically and statistically significant, even if the effect size was small. By setting the null hypothesis to absolute LFC < 1, we can discard these small effect LFCs. In each group contrast, we can see that the number of significant DE genes drops by a factor of 10, when setting this higher threshold.

Shrunken Log-fold change MA plots

library(ashr)
# A vs B shrunken
AB_shrink <- lfcShrink(gene_dds, contrast = c("group","A","B"),
          type = 'ashr', lfcThreshold = 0)

DESeq2::plotMA(AB_shrink, alpha=0.01, main='A-B shrunken LFC DE (LFC=0, P<0.01)',
               ylim=c(-10,10),colSig='blue4')

# Labels number of significant genes
text(10000,7,labels = paste0("Significant genes = ",
                             dim(AB_shrink %>% subset(padj < 0.01))[1]))

# Unshrunken 
DESeq2::plotMA(AvB.lfc0, alpha=0.01, main='A-B DE (LFC=0, P<0.01)',
               ylim=c(-10,10), colSig='blue4')

# Labels number of significant genes
text(10000,7,labels = paste0("Significant genes = ",
                             dim(AvB.lfc0 %>% subset(padj < 0.01))[1]))

# A vs C shruken
AC_shrink <- lfcShrink(gene_dds,contrast = c("group","A","C"),
          type = 'ashr', lfcThreshold = 0)

DESeq2::plotMA(AC_shrink, alpha=0.01, main='A-C shrunken LFC DE (LFC=0, P<0.01)',
               ylim=c(-10,10), colSig='blue4')

text(10000,7,labels = paste0("Significant genes = ",
                             dim(AC_shrink %>% subset(padj < 0.01))[1]))

# Unshrunken 
DESeq2::plotMA(AvC.lfc0,alpha=0.01, main='A-C DE (LFC=0, P<0.01)',
               ylim=c(-10,10),colSig='blue4')

text(10000,7,labels = paste0("Significant genes = ",
                             dim(AvC.lfc0 %>% subset(padj < 0.01))[1]))

# B vs C shruken
BC_shrink <- lfcShrink(gene_dds,contrast = c("group","B","C"),
          type = 'ashr', lfcThreshold = 0)

DESeq2::plotMA(BC_shrink, alpha=0.01, main='B-C shrunken LFC DE (LFC=0, P<0.01)',
               ylim=c(-10,10), colSig='blue4')

text(10000,7,labels = paste0("Significant genes = ",
                             dim(BC_shrink %>% subset(padj < 0.01))[1]))

# Unshrunken 
DESeq2::plotMA(BvC.lfc0, alpha=0.01, main='B-C DE (LFC=0, P<0.01)',
               ylim=c(-10,10), colSig='blue4')

text(10000,7,labels = paste0("Significant genes = ",
                             dim(BvC.lfc0 %>% subset(padj < 0.01))[1]))
Figure 5. Shrunken LFC MA plotsFigure 5. Shrunken LFC MA plotsFigure 5. Shrunken LFC MA plotsFigure 5. Shrunken LFC MA plotsFigure 5. Shrunken LFC MA plotsFigure 5. Shrunken LFC MA plots

Figure 5. Shrunken LFC MA plots

In Figure 5., here we are comparing the normal LFC against shrunken LFC for all DE contrasts using a null hypothesis of LFC =0. The shrunken LFC are like the shrunken dispersion values seen in Figure 1. Here we used the shrinkage estimator ‘ashr’ by Stephens (2016).

Here, the counts from 0 to 10 have had their LFCs shrunken towards 0, whereas genes with counts greater than 10 have only had their LFCs shrunken less. This reduces the chances of detecting a false positive result from those low gene counts with exaggerated DE. In our results, it hasn’t affected the number of significant genes but has reduced the effect size of these genes.

References

Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. https://doi.org/10.1186/s13059-014-0550-8

Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2. https://doi.org/10.1093/biostatistics/kxw041